Skip to content

R3 delta replay picks#2647

Draft
samsja wants to merge 13 commits into
mainfrom
r3-delta-replay-picks
Draft

R3 delta replay picks#2647
samsja wants to merge 13 commits into
mainfrom
r3-delta-replay-picks

Conversation

@samsja
Copy link
Copy Markdown
Member

@samsja samsja commented May 27, 2026

No description provided.

samsja and others added 4 commits May 24, 2026 19:47
Squashed from origin/r3-delta (tip 5c94833, which extends the earlier
3799bda with 'Support branched routed expert deltas' for cases where
the routed-experts payload diverges across siblings in a group).

Adapts delta replay to main's deferred routed-experts chunk concat:
first step starts at 0; extended steps use prefix_len - 1; row 0 fills
the boundary, remaining rows append as the new suffix. Bumps router
wheel pin to local-path. Bumps deps/verifiers gitlink to d39cc5876.

Adds four debug configs for router-replay validation.

Co-Authored-By: S1ro1 <matej.sirovatka@gmail.com>
The first-match-wins loop over active_samples picks the wrong sample when
one active prefix is a strict prefix of another. This can happen after a
compaction/rollback step whose prompt is shorter than an existing
sample's prefix and whose completion re-generates the same tokens and
extends past them: the new sample's prefix then starts with the older
sample's prefix, and any later step that extends the new sample also
satisfies the slice check against the older one.

When that happens, extend_sample folds the newer sample's generated
tokens into the older sample as user-input tokens (mask=False,
logprob=0) and leaves the newer sample stale -- a silent Exact-Prefix
invariant violation.

Switch to longest-match: strictly more specific, never worse than
first-match when only one prefix matches.

Co-authored-by: Cursor <cursoragent@cursor.com>
(cherry picked from commit 0e239d1)
When more than one active prefix matches a step's prompt, log a warning
with the example id, step index, set of matching prefix lengths, total
active prefixes, and the prompt length. Longest-match still picks the
correct extension; the warning just surfaces the rare ambiguous case so
it's debuggable if it starts showing up in real rollouts (e.g. from
compaction/rollback turns).

Co-authored-by: Cursor <cursoragent@cursor.com>
(cherry picked from commit ca38614)
samsja and others added 9 commits May 27, 2026 08:19
Add slurm.cleanup_grace_period_seconds (default 3600) so that when a
component exits — completion, crash, or SIGTERM — the multi-node RL and
inference sbatch teardown sends SIGTERM and then waits up to the grace
period for the remaining processes to exit before force-killing and
releasing the allocation. This gives in-flight work, notably trainer
checkpoint writes, a bounded window to flush. The wait ends as soon as
all processes exit, so it is only an upper bound; set to 0 for the
previous immediate force-kill behavior.

Closes #2664

Co-authored-by: Cursor <cursoragent@cursor.com>
Drop the _seconds suffix; the unit is documented in the field docstring.

Co-authored-by: Cursor <cursoragent@cursor.com>
Co-authored-by: Cursor <cursoragent@cursor.com>
The previous SIGTERM-then-wait approach didn't help the target case
(inference dies while the trainer is mid-checkpoint on another node):
that teardown is driven by `srun --kill-on-bad-exit=1`, which reaps the
trainer task via SLURM's own KillWait path and never runs our in-task
grace loop.

Instead, on a non-zero exit the failing node now stays alive (signalling
nothing) for the grace period before propagating the exit. Because
--kill-on-bad-exit only fires when a task exits, holding the failing
task keeps peer nodes' checkpointing trainers running untouched until
they flush. Clean (zero-exit) completion is unaffected.

Scope to multi_node_rl only; the inference-only template has no trainer
checkpoints to protect, so it reverts to immediate teardown.

Co-authored-by: Cursor <cursoragent@cursor.com>
Co-authored-by: Cursor <cursoragent@cursor.com>
Co-authored-by: Cursor <cursoragent@cursor.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants